The S&P 500 is an American stock market index based on the market capitalizations of 500 large companies having common stock listed on the NYSE, NASDAQ, or the Cboe BZX Exchange. It is one of the most commonly followed equity indices, and one of the best representations of the U.S. stock market. Many people consider it to be a bellwether and leading indicator for the economy in addition to the default vehicle for passive investors who want exposure to the U.S. economy via index funds. Since 1957, the S&P 500 has performed remarkably, outpacing other major asset classes such as bonds or commodities.

In this project, we would like to examine the possible models that help predict the performance of S&P 500 index. This would allow investors (portfolio managers as well as individual investors) to predict the general economic condition and investment climate given the current market conditions.

We will include both quantitative and qualitative inputs in the model. Quantitative inputs include Fama-French three-factor variables (SML, HML, risk premium), macroeconomic indicators (e.g. US GDP, unemployment rate, inflation index), technical analysis indicators (e.g. relative strength indicator, Z-score, accumulation) and indices for other financial assets or sectors (bond prices, index for emerging markets). Apart from these, market sentiment would also be used as a qualitative input, which can be obtained through tweets.

The first model we have developed is to nowcast the current US GDP. The Real GDP is universally recognized as the best economic indicator in measuring the state of the economy, and thus it has a huge impact on the S&P 500. However, GDP figures for the current quarter are only release 3 months after the end of the quarter, limiting the decision making of investors. Therefore, it would be of great value to nowcast the current GDP and use it in our subsequent regression analysis.

The second model is a text analytics model, where we conduct sentiment analysis on tweets from 300 influential Finance Twitter accounts. With the sentiment values, we can use the market sentiment information to predict S&P 500 performance.

Then, we use regression to determine the best model that gives the highest accuracy in predicting S&P 500.

Lastly, the team will simulate the possible economic condition in 2019 and make use of our final model to determine the possibility of a market crash or boom in 2019.

Nowcasting the current GDP of the United States of America

Motivation: The Real GDP is largely recognized as the best economic indicator in measuring the state of the economy. Consquently, GDP figures can have a large impact on the stock market. However, GDP figures for the current quarter are only release 3 months after the end of the quarter, limiting the decision making of investors. But we know that there are other economic indicators that are released monthly and the information provided by these indicators may be useful in informing us of the current state of the economy.

The first step involves the collecting and processing of data for the various economic indicators.

library(timeDate)
library(DataCombine)

#read in multiple csv files
temp = list.files("./Nowcast Data", pattern="*.csv")
myfiles = lapply(temp, function(x) read.csv(paste("./Nowcast Data/", x, sep=""),stringsAsFactors = FALSE))

names(myfiles) <- gsub(".csv","",
                       list.files("./Nowcast Data",full.names = FALSE),
                       fixed = TRUE)

#function to convert string data into numeric values and align the dates of the various economic indicators
readAndClean <- function(originalData) {
  
  histData <- originalData
  histData[ ,c('Forecast', 'Time')] <- list(NULL)
  histData$Actual <- as.numeric(gsub(pattern = '%|M|K',replacement = '',x = histData$Actual))
  histData$Month <- match(substr(x = histData$Release.Date,start = 1,stop=3),month.abb)-1
  histData$Year <- ifelse(histData$Month==0,as.numeric(substr(x = histData$Release.Date,start=9,stop=13))-1,as.numeric(substr(x = histData$Release.Date,start=9,stop=13)))
  histData$Month <- ifelse(histData$Month==0,histData$Month+12,histData$Month)
  histData$Month <- ifelse(histData$Month<10,paste0("0",histData$Month),histData$Month)
  histData$Date <- timeLastDayInMonth(as.Date(as.character(paste0("01-",histData$Month,"-",histData$Year)),"%d-%m-%Y"))
  
  histData <- histData[,-c(1,3,4)]
  histData <- histData[c(2,1)]
  VariableName <- deparse(substitute(originalData))
  variableName <- sub('.*', '', VariableName)
  colnames(histData) <- c(paste0(variableName,"_date"),variableName)
  return (histData)
}

df <- lapply(myfiles,readAndClean)
df <- do.call("cbind",df)

#To be consistent in units
df$CapUtilize. <- df$CapUtilize./100
df$CoreCPI. <- df$CoreCPI./100
df$CorePPI. <- df$CorePPI./100
df$HourlyEarnings. <- df$HourlyEarnings./100
df$RetailSales. <- df$RetailSales./100
df$Unemployment. <- df$Unemployment./100


#Create a new data frame to hold economic variables that are already recorded in terms of MoM change
newdf <- df[,c("CapUtilize._date","CoreCPI.","CorePPI.","HourlyEarnings.","RetailSales.")]
names(newdf)[names(newdf) == 'CapUtilize._date'] <- 'CoverDate'

#Refers to economic indicators that have to be transformed to MoM change
colToChange <- c("CapUtilize.","Unemployment.","BizOutlook.","ConsumerSentiment.","HousingStarts.","ISMPMI.","NFP.")

for (i in 1:7) {
  changedDf <- PercChange(df,Var=colToChange[i],type="percent",slideBy=1)
  newdf <- cbind(newdf,changedDf[,ncol(changedDf)])
  names(newdf)[ncol(newdf)] = colToChange[i]
  newdf[,ncol(newdf)] <- newdf[,ncol(newdf)]/100
}
## Warning: PercChange is depricated. Please use change.
## 
## Remember to put data in time order before running.
## 
## Leading CapUtilize. by 1 time units.
## Warning: PercChange is depricated. Please use change.
## 
## Remember to put data in time order before running.
## 
## Leading Unemployment. by 1 time units.
## Warning: PercChange is depricated. Please use change.
## 
## Remember to put data in time order before running.
## 
## Leading BizOutlook. by 1 time units.
## Warning: PercChange is depricated. Please use change.
## 
## Remember to put data in time order before running.
## 
## Leading ConsumerSentiment. by 1 time units.
## Warning: PercChange is depricated. Please use change.
## 
## Remember to put data in time order before running.
## 
## Leading HousingStarts. by 1 time units.
## Warning: PercChange is depricated. Please use change.
## 
## Remember to put data in time order before running.
## 
## Leading ISMPMI. by 1 time units.
## Warning: PercChange is depricated. Please use change.
## 
## Remember to put data in time order before running.
## 
## Leading NFP. by 1 time units.

The next step involves aggregating monthly data into quarterly frequency in order to perform a regression against the quarterly GDP later on. We used the bridge equation detailed in a BOJ paper (https://www.boj.or.jp/en/research/wps_rev/wps_2018/data/wp18e18.pdf) to aggregate the monthly data. The aggregating method is just a linear combination of the lagged values for that particular indicator.

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
newdf1 <- newdf

for (i in 1:nrow(newdf1)) {
  if ((month(newdf1$CoverDate[i]) == 3 | month(newdf1$CoverDate[i]) == 6 | month(newdf1$CoverDate[i]) == 9 | month(newdf1$CoverDate[i]) == 12) & year(newdf1$CoverDate[i])>2013)  {
    for (j in 2:12) {
      newdf1[i,j] <- 1/3*(newdf1[i,j] + 2*newdf1[i+1,j] + 3*newdf1[i+2,j] + 2*newdf1[i+3,j] + newdf1[i+4,j])
    }
  }
}

After the aggregation, we can remove the monthly data which are not at the quarter end. We consider the data from 2014 - 2018 for our regression. Intuitively, the various economic indicators should have a moderate to strong correlation, considering that they all serve to measure the underlying economic activity. We plot the correlation chart to confirm our intuition.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:lubridate':
## 
##     intersect, setdiff, union
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(PerformanceAnalytics)
## Loading required package: xts
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## 
## Attaching package: 'PerformanceAnalytics'
## The following objects are masked from 'package:timeDate':
## 
##     kurtosis, skewness
## The following object is masked from 'package:graphics':
## 
##     legend
predictors <- newdf1 %>% 
  select(-1) %>%
    slice(seq(4,61,3))

chart.Correlation(predictors, histogram=TRUE)

chart.Boxplot(predictors,ylim=c(-0.5,1.5))

Generating Principal Components

Although the correlation did not show strong correlation across all indicators, there were still several pairs of moderate to strong correlation. As such, we will extract the principal components to avoid the problem of multi-collinearity when we perform our regression later on. Additionally, the boxplot revealed the outsized variances of some economic indicators. Hence, we will use a standardized principal component analysis.

PCAValues <- prcomp(x = predictors, scale. = TRUE, center = TRUE)
PCAData <- unclass(PCAValues)

eigenVectors <- as.data.frame(PCAData[["rotation"]])

biplot(PCAValues,scale=0)

#The biplot reveals that retail sales, housing starts, production managers' index (PMI), business outlook and hourly earnings strongly influence PC1. It suggests that PC1 is about business conditions, since the first 4 predictors represent the sales volume of businesses and a higher hourly earnings growth would increase the operating cost for businesses. 

#The second principal component is probably related to labour market conditions, where higher factory capacity utilization means lower unemployment.

summary(PCAValues)
## Importance of components:
##                           PC1    PC2    PC3     PC4    PC5     PC6     PC7
## Standard deviation     1.8095 1.5214 1.2967 1.00608 0.9119 0.77545 0.72349
## Proportion of Variance 0.2977 0.2104 0.1529 0.09202 0.0756 0.05467 0.04759
## Cumulative Proportion  0.2977 0.5081 0.6609 0.75296 0.8286 0.88322 0.93081
##                            PC8     PC9    PC10    PC11
## Standard deviation     0.57147 0.46119 0.35596 0.30838
## Proportion of Variance 0.02969 0.01934 0.01152 0.00865
## Cumulative Proportion  0.96050 0.97984 0.99135 1.00000
screeplot(PCAValues)

The scree plot and a more lenient use of the kaiser rule suggest that the first 5 principal components explain most of the variances. Therefore, we will perform a linear regression of the Real GDP on the first 5 principal components.

#Transform our initial economic indicators into z-scores as we used standardized PCA
zpredictors <- as.data.frame(scale(x = predictors,center = TRUE,scale = TRUE))

#Matrix multiplication with eigenvectors to obtain a time series of the principal components
TimeSeries <- as.matrix(zpredictors) %*% as.matrix(eigenVectors) 
predictorsTimeSeries <- as.data.frame(TimeSeries)

#Read in and store the dependent variable, i.e. GDP
GDPData <- read.csv("GDP.csv",header = TRUE,stringsAsFactors = FALSE)
GDPData[ ,c('Forecast', 'Time')] <- list(NULL)
GDPData$Actual <- as.numeric(gsub(pattern = '%',replacement = '',x = GDPData$Actual))/100
GDPData$Release.Date <- NULL
GDPData <- scale(GDPData,center = TRUE,scale = TRUE)

fullTimeSeries <- cbind(GDPData,predictorsTimeSeries)
model <- lm(Actual~PC1 + PC2 + PC3 + PC4 + PC5,fullTimeSeries)
summary(model)
## 
## Call:
## lm(formula = Actual ~ PC1 + PC2 + PC3 + PC4 + PC5, data = fullTimeSeries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.34472 -0.22318  0.08653  0.65270  1.07565 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  9.351e-17  2.174e-01   0.000   1.0000  
## PC1         -2.631e-01  1.232e-01  -2.135   0.0509 .
## PC2          5.477e-02  1.466e-01   0.374   0.7143  
## PC3         -7.227e-02  1.720e-01  -0.420   0.6807  
## PC4         -1.201e-01  2.217e-01  -0.542   0.5965  
## PC5         -2.370e-01  2.445e-01  -0.969   0.3488  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9721 on 14 degrees of freedom
## Multiple R-squared:  0.3037, Adjusted R-squared:  0.05507 
## F-statistic: 1.221 on 5 and 14 DF,  p-value: 0.3498
#unfortunately, the model with the first 5 principal components as predictors shows results that are unacceptable with an Adjusted R-squared of 0.05507 and p-value of 0.3498.


#we next try a stepwise regression on the principal components to see if we can achieve a model with better accuracy.
stepWisePCA <- step(model)
## Start:  AIC=3.73
## Actual ~ PC1 + PC2 + PC3 + PC4 + PC5
## 
##        Df Sum of Sq    RSS    AIC
## - PC2   1    0.1319 13.361 1.9320
## - PC3   1    0.1669 13.396 1.9842
## - PC4   1    0.2774 13.506 2.1485
## - PC5   1    0.8879 14.117 3.0327
## <none>              13.229 3.7335
## - PC1   1    4.3070 17.536 7.3704
## 
## Step:  AIC=1.93
## Actual ~ PC1 + PC3 + PC4 + PC5
## 
##        Df Sum of Sq    RSS    AIC
## - PC3   1    0.1669 13.528 0.1802
## - PC4   1    0.2774 13.638 0.3429
## - PC5   1    0.8879 14.249 1.2187
## <none>              13.361 1.9320
## - PC1   1    4.3070 17.668 5.5203
## 
## Step:  AIC=0.18
## Actual ~ PC1 + PC4 + PC5
## 
##        Df Sum of Sq    RSS     AIC
## - PC4   1    0.2774 13.805 -1.4139
## - PC5   1    0.8879 14.416 -0.5484
## <none>              13.528  0.1802
## - PC1   1    4.3070 17.835  3.7084
## 
## Step:  AIC=-1.41
## Actual ~ PC1 + PC5
## 
##        Df Sum of Sq    RSS     AIC
## - PC5   1    0.8879 14.693 -2.1673
## <none>              13.805 -1.4139
## - PC1   1    4.3070 18.112  2.0170
## 
## Step:  AIC=-2.17
## Actual ~ PC1
## 
##        Df Sum of Sq    RSS      AIC
## <none>              14.693 -2.16725
## - PC1   1     4.307 19.000  0.97413
summary(stepWisePCA)
## 
## Call:
## lm(formula = Actual ~ PC1, data = fullTimeSeries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.39735 -0.36408 -0.01338  0.56518  1.20980 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  7.693e-17  2.020e-01   0.000   1.0000  
## PC1         -2.631e-01  1.145e-01  -2.297   0.0338 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9035 on 18 degrees of freedom
## Multiple R-squared:  0.2267, Adjusted R-squared:  0.1837 
## F-statistic: 5.276 on 1 and 18 DF,  p-value: 0.03383
#This time, we achieve an adjusted R-squared of 0.4784 and significant p-value of 0.01884 on principal components 1,5,6,8,9,11.

Unfortunately, we are not confident in explaining the later few principal components that were chosen by the stepwise regression. As such, we decided to drop the nowcasting exercise and proceed with other models instead.

Text Analytics

In the following section, we employ text analytics techniques to investigate market sentiment, and attempt to crystalize any relationship between the sentiment and immediate changes in the S&P 500 index value.

Data Collection

Tweets from 300 twitter accounts that publish contents relevant to the financial market are gathered in preparation for the analysis. The list of accounts is gathered after online research, and it includes major news outlets such as Reuters, financial institutions such as the Federal Reserve, as well as many financial experts, amongst others. The full list is stored in “TwitterAccount.csv”.

To start off with, 3,000 tweets from each of the accounts are obtained, and eventually stored in a file named “TwitterFullData.csv” for further processing.

library(twitteR)
library(ROAuth)
#Setting up the Twitter developer account
consumer_key <- "wD6Ji9xpUGpDKtmKOLAlKJdAk"
consumer_secret <- "HUCTSsLXDAZBSpLLDcG3NUneynm1VfOIHNjy04XSlEatS5dLGs"
token <- "1115068400819036160-ihFxhYoxOv3eZkh2CurhTmLhWH0XuH"
token_Secret <- "OVyYavVITFMFBCXvqDYZq03XC2nDU3oK3rxoiWlPOS8lc"
setup_twitter_oauth(consumer_key, consumer_secret, token, token_Secret)

#Extracting raw tweets from identified accounts
KeyAcct <- read.csv("TwitterAccount.csv")
Tweets <- userTimeline(KeyAcct[1,1], n = 3000, includeRts = FALSE)
Tweet_Full <- twListToDF(Tweets)

for(i in 2:nrow(KeyAcct)){
  if(getUser(KeyAcct[i,1])$protected==F)
    KeyAcct_Tweet <- userTimeline(KeyAcct[i,1], n = 3000, includeRts = FALSE)
  
  Tweet_Full <- rbind(Tweet_Full, twListToDF(KeyAcct_Tweet))
}

write.csv(Tweet_Full, file="TwitterFullData.csv")

Text Processing

Next, we clean the crawled tweets by firstly extracting only those posted between Jan 1st 2014 and Dec 31st 2018, which is our intended study period. Retained tweets are further filtered through comparing their content with a list of 238 financial terms. The full list of terms is stored in “EffectiveWords.csv”, and it contains terms such as “yield”, “risk”, and “financ”. Tweets that do not contain any of the 238 terms are to be discarded, since it is likely that they do not carry information that is useful for our investigation on the market sentiment. After the filtering is done, the final set of tweets is stored into “tweet_useful.csv”.

Tweet_Full >- read.csv("TwitterFullData.csv")
Tweet_Full <- Tweet_Full[,c("text","created")]

#Filter tweets within time period 01/01/2014 - 31/12/2018
Tweet_Full$created <- substr(Tweet_Full$created, 1, 10)
Tweet_Full$created <- as.Date(Tweet_Full$created, format = "%Y-%m-%d")
Tweet_Full <- subset(Tweet_Full, Tweet_Full$created >= as.Date("2014-01-01"))
Tweet_Full <- subset(Tweet_Full, Tweet_Full$created <= as.Date("2018-12-31"))

#Read in the effective words library to filter out irrelevant tweets
effective_words <- read.csv("EffectiveWords.csv", stringsAsFactors=FALSE, header = T)

#Create an empty data frame to store the useful tweets and their creation time
Tweet_useful = data.frame(matrix(nrow = 1, ncol=2))
colnames(full_tweet) <- c("text", "created")

#Go through tweet by tweet to identify the useful tweet that contain the key words in the library
for(i in 1:nrow(Tweet_Full)){
  for(j in 1:nrow(effective_words)){
    if((length(grep(lib[j], Tweet_Full[i,1])) > 0)){ 
      Tweet_useful <- rbind(Tweet_useful, Tweet_Full[i,])
      break
    }
  }
}

write.csv(Tweet_useful, file="tweet_useful.csv")

Some tweet snippets are shown below:

Tweet_useful <- read.csv("tweet_useful.csv")
head(Tweet_useful$text, 10)
##  [1] @joetrader6 @CNBC @WSJ @business @bpolitics @hblodget There are no such restrictions on elected officials                                     
##  [2] @jessefelder Jesse, how does this compare to U.S. and E.U.?                                                                                   
##  [3] @d0ubleorn0thing Could be completed, lead to brief FOMO, then roll over                                                                       
##  [4] @imarvindmahesh I do not have a trade on. It is an appraisal of the chart. An opinion is not a position.                                      
##  [5] Price action in #Cable could qualify as an "end-around," implying an advance back to 1.3800 $gbpusd https://t.co/shbkOdO7hk                   
##  [6] Many chartists view $BTC as a H&amp;S bottom. I am NOT among them. Higher probability is that bear market is not over. https://t.co/64BmxmNWyy
##  [7] When back testing a trading system make sure to use out-of-sample data in addition to in-sample data to confirm a t… https://t.co/snOb9JULYw  
##  [8] @HoustonMarck @SJosephBurns @traderstewie @TraderMentality @option_snipper @allstarcharts @AOTtrades Trading does n… https://t.co/2MpZVVprvX  
##  [9] @Grizzlyshort Lower risk during losing streaks, ramp back up when wins occur. I use a grid system to determine sizing/risk per trade          
## [10] @Secret_Profits A trading approach can be out of synch with markets -- this is normal the purpose of aggressive ris… https://t.co/dygLv7fqRL  
## 60053 Levels: _________\n| Frankly, what     |\n| this newsletter   |\n| has to say is a   |\n| very small part   |\n| of the pressure… https://t.co/9PklNy7jDI ...

Word Cloud

For an initial quick analysis of the text, we produce a word cloud to display the most frequently appeared words in our collected tweets. With the set of useful tweets, we first sort the tweets in chronological order based on their creation time, and then perform the necessary text processing.

Tweet_useful_sorted <- Tweet_useful[order(as.Date(Tweet_useful$created, format="%d/%m/%Y")),]

library(tm)
## Loading required package: NLP
Tweet_Text_Corpus <- Corpus(VectorSource(Tweet_useful_sorted$text))

#Text cleaning
RemoveURL <- function(x) gsub("(http://t.co)(.*)", "", x)
RemoveURLs <- function(x) gsub("(https://t.co)(.*)", "", x)
RemoveName <- function(x) gsub("@\\w+", "", x)
RemoveAmp <- function(x) gsub("amp;", "", x)
Tweet_Text_Corpus <- tm_map(Tweet_Text_Corpus, content_transformer(RemoveURL))
## Warning in tm_map.SimpleCorpus(Tweet_Text_Corpus,
## content_transformer(RemoveURL)): transformation drops documents
Tweet_Text_Corpus <- tm_map(Tweet_Text_Corpus, content_transformer(RemoveURLs))
## Warning in tm_map.SimpleCorpus(Tweet_Text_Corpus,
## content_transformer(RemoveURLs)): transformation drops documents
Tweet_Text_Corpus <- tm_map(Tweet_Text_Corpus, content_transformer(RemoveName))
## Warning in tm_map.SimpleCorpus(Tweet_Text_Corpus,
## content_transformer(RemoveName)): transformation drops documents
Tweet_Text_Corpus <- tm_map(Tweet_Text_Corpus, content_transformer(RemoveAmp))
## Warning in tm_map.SimpleCorpus(Tweet_Text_Corpus,
## content_transformer(RemoveAmp)): transformation drops documents
Tweet_Text_Corpus <- tm_map(Tweet_Text_Corpus, stripWhitespace)
## Warning in tm_map.SimpleCorpus(Tweet_Text_Corpus, stripWhitespace):
## transformation drops documents
Tweet_Text_Corpus <- tm_map(Tweet_Text_Corpus, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(Tweet_Text_Corpus,
## content_transformer(tolower)): transformation drops documents
Tweet_Text_Corpus <- tm_map(Tweet_Text_Corpus, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(Tweet_Text_Corpus, removeWords,
## stopwords("english")): transformation drops documents
Tweet_Text_Corpus <- tm_map(Tweet_Text_Corpus, removeNumbers)
## Warning in tm_map.SimpleCorpus(Tweet_Text_Corpus, removeNumbers):
## transformation drops documents
Tweet_Text_Corpus <- tm_map(Tweet_Text_Corpus, removePunctuation)
## Warning in tm_map.SimpleCorpus(Tweet_Text_Corpus, removePunctuation):
## transformation drops documents
#Text stemming
library(SnowballC)
Tweet_Text_Corpus <- tm_map(Tweet_Text_Corpus, wordStem, language = "english")
## Warning in tm_map.SimpleCorpus(Tweet_Text_Corpus, wordStem, language =
## "english"): transformation drops documents
#Identify the most frequently appeared words
Tweet_tdm <- TermDocumentMatrix(Tweet_Text_Corpus)
Tweet_RemoveSparse_tdm <- removeSparseTerms(Tweet_tdm, sparse = 0.999)
freq <- sort(rowSums(as.matrix(Tweet_RemoveSparse_tdm)), decreasing=TRUE)

head(freq, 50)
##       …  market    just    like    will    year     can trading     now 
##    5077    3179    2684    2508    2491    2397    2182    2140    2131 
##     one    time     new    good   today   think    last     day    week 
##    2105    2084    1988    1939    1740    1722    1707    1602    1572 
##   since     get     see   years   stock   short    high     low     spx 
##    1533    1496    1488    1481    1479    1406    1392    1351    1338 
##   trade  people  stocks      ’s   great    long   still    know    much 
##    1251    1247    1237    1203    1201    1163    1140    1131    1099 
##    back    next   icymi     big markets   right    also   calls   price 
##    1090    1031     988     973     962     961     957     951     950 
##   first     vix    make    call    many 
##     943     936     934     929     919

Looking at the above most frequently appeared words gives us an idea of words that may require fixing or replacement.

#Fix or remove words of undesirable form
ReplaceWord <- function(x) gsub("vol", "volume", x) 
Tweet_Text_Corpus <- tm_map(Tweet_Text_Corpus, content_transformer(ReplaceWord))
ReplaceWord <- function(x) gsub("…", "", x) 
Tweet_Text_Corpus <- tm_map(Tweet_Text_Corpus, content_transformer(ReplaceWord))
ReplaceWord <- function(x) gsub("’s", "", x) 
Tweet_Text_Corpus <- tm_map(Tweet_Text_Corpus, content_transformer(ReplaceWord))
ReplaceWord <- function(x) gsub("t…", "", x) 
Tweet_Text_Corpus <- tm_map(Tweet_Text_Corpus, content_transformer(ReplaceWord))
ReplaceWord <- function(x) gsub("“", "", x) 
Tweet_Text_Corpus <- tm_map(Tweet_Text_Corpus, content_transformer(ReplaceWord))
more.stop.words <- c("just", "time", "week", "weeks", "like", "day", "days", "years", "stocks", "will", "one", "can", "now", "look", "month", "months", "say", "way", "icymi", "next", "things", "thing", "markets", "also", "vix", "make", "many", "come", "two", "may", "yes", "name", "really", "actually", "yet", "even", "another", "always", "seem", "every", "got", "looks", "ago", "getting", "probably", "part", "might", "pretty",  "give", "idea", "twitter", "tweet", "live", "said", "guy", "something", "likely", "seems", "says", "said", "don’t", "ever", "maybe", "though", "almost", "times", "given", "morning", "seen", "’re", "someone", "everyone", "nothing", "yeah", "anyone", "todays", "per", "already", "saying", "aapl", "thank", "trying")

Tweet_Review_Corpus <- tm_map(Tweet_Text_Corpus, removeWords, more.stop.words)

#Display the word cloud to view the most frequently appeared words from the tweets
library(wordcloud)
## Loading required package: RColorBrewer
## 
## Attaching package: 'wordcloud'
## The following object is masked from 'package:PerformanceAnalytics':
## 
##     textplot
wordcloud(Tweet_Review_Corpus, min.freq = 450, max.words = Inf, 
          random.order = FALSE, colors = brewer.pal(8, "Dark2"))

The word cloud obtained testifies the content validity of the tweets that we have gathered, as the frequently appeared words are in general all finance-specific (such as fund, spx, and stock) or finance-related (such as trading, growth, and price). We are thus assured that any further analysis on these tweets text will produce relevant and useful results.

Sentiment Analysis

After processing the tweets, we are going to perform sentiment analysis on the relevant tweets. To conduct the analysis, we will be using the Syuzhet package, which comes with four sentiment dictionaries (“syuzhet”, “bing”, “afinn”, and “nrc”).

By using the get_sentiment function, we can get the sentiment score for each tweet. One thing to take note is that different dictionaries use different scales.

library(syuzhet)
Tweet_Review_Data <- as.vector(t(data.frame(text = sapply(Tweet_Review_Corpus, as.character), stringsAsFactors = FALSE)))

#Get sentiment values using four different dictionaries
syuzhet_sentiment <- get_sentiment(Tweet_Review_Data, method = "syuzhet", language = "english")
bing_sentiment <- get_sentiment(Tweet_Review_Data, method = "bing", language = "english")
afinn_sentiment <- get_sentiment(Tweet_Review_Data, method = "afinn", language = "english")
nrc_sentiment <- get_sentiment(Tweet_Review_Data, method = "nrc", language = "english")

We combine scores from different dictionaries into one data frame sentiment_values. Then, we compare the pair-wise correlation coefficient among different dictionary methods. From the result, we can see that outcome using syuzhet dictionary has the highest correlation coefficient with the outcomes from other dictionaries; while the correlation coefficients between nrc and bing as well as nrc and afinn are relatively low.

#Combine different scores into a data frame, and compare pairwise correlation coefficeient among different dictionary methods
sentiment_values <- data.frame(syuzhet=syuzhet_sentiment, bing=bing_sentiment, afinn=afinn_sentiment, nrc=nrc_sentiment)
cor(sentiment_values)
##           syuzhet      bing     afinn       nrc
## syuzhet 1.0000000 0.7216827 0.6903409 0.6792909
## bing    0.7216827 1.0000000 0.6625483 0.4687012
## afinn   0.6903409 0.6625483 1.0000000 0.4501116
## nrc     0.6792909 0.4687012 0.4501116 1.0000000

Monthly Sentiment Values over Time

We also add in the date when the tweet was posted. Moreover, week number and month number for each tweet are added in. As the time horizon of our analysis is from 2014 to 2018, we will set the week number as 1 for all those tweets that were posted in the first week of 2014. Likewise, we will set the month number as 1 for tweets that were posted in January 2014.

We calculate the monthly mean sentiment values under different dictionaries and plot the monthly sentiment values over time in the graphs below. The first plot aggregates the sentiment trends from four dictionaries to have a better comparison among dictionaries.

From the plots, we can see that it generally reflects the market sentiment. For example, between Nov 2015 to Feb 2016 (Month 23 to 26), the sentiment values are in the low end. This can be possibly explained by the Chinese stock market turbulence, which affects the overall market confidence. Sentiment values in 2018 are in the relatively high end, because during this period, S&P 500 index kept making record highs.

#Add in date as well as week number and month number (starting reference point 29/12/2013)
sentiment_values$date <- as.Date(Tweet_useful_sorted$created,"%d/%m/%Y")

sentiment_values$week <- as.numeric(floor(difftime(sentiment_values$date, as.Date("29/12/2013","%d/%m/%Y"), units = "weeks")))+1
sentiment_values$week <- as.factor(sentiment_values$week)

sentiment_values$month <- (as.numeric(substr(sentiment_values$date,1,4))-2014)*12+as.numeric(substr(sentiment_values$date,6,7))
sentiment_values$month <- as.factor(sentiment_values$month)

#Time series plot: Monthly Sentiment Values over Time
sentiment_time_series_byMonth <- 
  data.frame(month=1:60, syuzhet=tapply(sentiment_values$syuzhet, sentiment_values$month, mean),
             bing=tapply(sentiment_values$bing, sentiment_values$month, mean),
             afinn=tapply(sentiment_values$afinn, sentiment_values$month, mean),
             nrc=tapply(sentiment_values$nrc, sentiment_values$month, mean))

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
ggplot(data = sentiment_time_series_byMonth) + geom_line(aes(x = month, y = syuzhet, color = "red")) + geom_line(aes(x = month, y = bing), color = "blue") + geom_line(aes(x = month, y = afinn), color = "green") + geom_line(aes(x = month, y = nrc), color = "black") + labs(x = "Month", y = "Sentiment Values", title = "Monthly Sentiment Values over Time") + theme(legend.position = "none")

ggplot(data = sentiment_time_series_byMonth, aes(x = month, y = syuzhet)) + geom_line(size = 1) + labs(x = "Month", y = "Sentiment Values", title = "Monthly Sentiment Values over Time (syuzhet method)")

ggplot(data = sentiment_time_series_byMonth, aes(x = month, y = bing)) + geom_line(size = 1) + labs(x = "Month", y = "Sentiment Values", title = "Monthly Sentiment Values over Time (bing method)")

ggplot(data = sentiment_time_series_byMonth, aes(x = month, y = afinn)) + geom_line(size = 1) + labs(x = "Month", y = "Sentiment Values", title = "Monthly Sentiment Values over Time (afinn method)")

ggplot(data = sentiment_time_series_byMonth, aes(x = month, y = nrc)) + geom_line(size = 1) + labs(x = "Month", y = "Sentiment Values", title = "Monthly Sentiment Values over Time (nrc method)")

S&P 500 Prediction using Sentiment Values

We would like to use the weekly sentiment value to predict whether S&P 500 has moved up or down during this week. If the sentiment value is above a certain threshold, we will classify the average sentiment as positive and we predict the S&P 500 has moved up; vice versa, if the sentiment value is below the threshold, we will classify the average sentiment as negative and we predict the S&P 500 has moved down.

A function findThreshold is developed to find the optimal classification threshold for each dictionary method, where the prediction accuracy produced by confusion matrix is the highest.

weekly_return <- read.csv("SPXWeeklyReturn.csv")
sentiment_time_series_byWeek <- 
  data.frame(week=1:262, updown = ifelse(weekly_return$Return>0,"Up","Down"),
             return=weekly_return$Return, 
             syuzhet=tapply(sentiment_values$syuzhet, sentiment_values$week, mean),
             bing=tapply(sentiment_values$bing, sentiment_values$week, mean),
             afinn=tapply(sentiment_values$afinn, sentiment_values$week, mean),
             nrc=tapply(sentiment_values$nrc, sentiment_values$week, mean))

#function to find the best threshold with highest accuracy
findThreshold <- function(method){
  best_threshold <- 0
  best_accuracy <- 0

  if (method=="syuzhet"){
    data <- syuzhet_sentiment
  } else if (method=="bing"){
    data <- bing_sentiment
  } else if (method=="afinn"){
    data <- afinn_sentiment
  } else if (method=="nrc"){
    data <- nrc_sentiment
  } else {
    return()
  }
  
  for (i in seq(min(data),max(data),0.01)){
    table <- table(sentiment_time_series_byWeek$updown == "Up", sentiment_time_series_byWeek[,method]>i)
    if (dim(table)[2]==1)
      next
    accuracy <- (table[1,1]+table[2,2])/sum(table)
    if (accuracy>best_accuracy){
      best_threshold <- i
      best_accuracy <- accuracy
    }
  }
  return(c(threshold=best_threshold,accuracy=best_accuracy))
}

syuzhet_threshold <- findThreshold("syuzhet")
bing_threshold <- findThreshold("bing")
afinn_threshold <- findThreshold("afinn")
nrc_threshold <- findThreshold("nrc")
syuzhet_threshold
##  threshold   accuracy 
## -0.1700000  0.5801527
bing_threshold
##  threshold   accuracy 
## -0.3500000  0.5954198
afinn_threshold
##  threshold   accuracy 
## -0.6900000  0.5839695
nrc_threshold
##  threshold   accuracy 
## -0.2700000  0.5839695

From the result, we can see that the prediction accuracy for four methods are roughly the same, around 58%, which is not very high. There are several explanations for this:

  1. Due to Twitter API data crawling constraint, we are only allowed to crawl the most recent 3000 tweets from each account. Thus, for year 2014-2015, we could not capture a comprehensive set of tweets.
  2. The dictionaries provided by syuzhet package might not correctly capture the true sentiment for finance tweets, as finance tweets may contain some jargons.
  3. As we are using weekly average sentiment values to predict weekly returns, this might lower the prediction accuracy, because it takes an average over daily sentiment values and thus lowering the predicting power. It would be more accurate to predict daily S&P 500 movement using daily average sentiment value.

Nevertheless, despite the result, our text analytic approach does provide hedge funds or other asset management companies with a general methodology on how to leverage Twitter tweets sentiment information to aid investment decisions.

Regression

Regression is carried out using weekly spx prices as the dependent variable from Jan 2014 to Dec 2018. The explanatory variables contain the following: 1. Market & S&P Portfolio Indices: S&P 500 portfolio price to equity ratio (p/e), the volatility index (vix), put to call ratio (put/call) 2. Technical Analysis Indicators: Zscore, Relative Stregnth Index (rsi), volume weighted average price (vwap) 3. Fama-French 3 Factor Model Variales: Excess market return (excess.mkt.return), Small Minus Big (smb), High Minus Low (hml), risk-free rate (rf) 4. Macroeconomic Indicators: Gross domestic product of US (GDP), Inlfation rate, Unemployment rate

Data cleaning is first carried out to prepare for regression later.

data<-read.csv("RegressionData.csv")

colnames(data) <-c("date","spx","p/e","vix","put/call","Zscore","rsi","vwap","excess.mkt.return","smb","hml","rf","gdp","inflation","unemployment")
data<-na.omit(data)

#format first column to proper date format
data$date<-as.Date(data$date,origin="1899-12-30")
data$gdp<-as.numeric(data$gdp)
data$inflation<-as.numeric(data$inflation)
data$unemployment<-as.numeric(data$unemployment)

summary(data)#check data type
##       date                 spx            p/e             vix       
##  Min.   :2014-02-07   Min.   :1797   Min.   :15.83   Min.   : 9.14  
##  1st Qu.:2015-04-29   1st Qu.:2033   1st Qu.:18.10   1st Qu.:11.97  
##  Median :2016-07-18   Median :2127   Median :19.30   Median :13.53  
##  Mean   :2016-07-18   Mean   :2264   Mean   :19.43   Mean   :14.77  
##  3rd Qu.:2017-10-07   3rd Qu.:2507   3rd Qu.:20.81   3rd Qu.:16.27  
##  Max.   :2018-12-28   Max.   :2930   Max.   :23.39   Max.   :30.11  
##     put/call         Zscore             rsi             vwap     
##  Min.   :1.036   Min.   :-3.1940   Min.   :28.15   Min.   :1787  
##  1st Qu.:1.608   1st Qu.: 0.1802   1st Qu.:42.51   1st Qu.:2002  
##  Median :1.900   Median : 1.2391   Median :51.57   Median :2098  
##  Mean   :1.894   Mean   : 0.9906   Mean   :51.65   Mean   :2229  
##  3rd Qu.:2.134   3rd Qu.: 2.0707   3rd Qu.:59.24   3rd Qu.:2461  
##  Max.   :3.417   Max.   : 4.2366   Max.   :82.45   Max.   :2826  
##  excess.mkt.return      smb               hml                rf         
##  Min.   :-7.3300   Min.   :-3.1700   Min.   :-3.8400   Min.   :0.00000  
##  1st Qu.:-0.6950   1st Qu.:-0.7450   1st Qu.:-0.8200   1st Qu.:0.00000  
##  Median : 0.3150   Median : 0.0150   Median :-0.1550   Median :0.00500  
##  Mean   : 0.1669   Mean   :-0.0491   Mean   :-0.0459   Mean   :0.01196  
##  3rd Qu.: 1.2825   3rd Qu.: 0.6225   3rd Qu.: 0.6025   3rd Qu.:0.02100  
##  Max.   : 4.7800   Max.   : 6.0700   Max.   : 3.9100   Max.   :0.04800  
##       gdp              inflation           unemployment    
##  Min.   :-0.007330   Min.   :-0.0002501   Min.   :0.03700  
##  1st Qu.: 0.003358   1st Qu.: 0.0002499   1st Qu.:0.04275  
##  Median : 0.005701   Median : 0.0002499   Median :0.04900  
##  Mean   : 0.005550   Mean   : 0.0003133   Mean   :0.04939  
##  3rd Qu.: 0.008211   3rd Qu.: 0.0004996   3rd Qu.:0.05500  
##  Max.   : 0.012272   Max.   : 0.0007492   Max.   :0.06700

A simple plot is produced to have a rough idea of how the dependent variable, S&P 500 index value (denoted as spx) looks like.

library(ggplot2)
ggplot(data, aes(x=date, y=spx)) + geom_line()

Further data cleaning is carried out to remove variables that exhibit significant correlation with other variables. This is to eliminate inaccuracies due to the possibility of multicollinearity.

library(PerformanceAnalytics)
data2<-subset(data[,-1]) #remove the first column "date", leave only numeric values
#summary(data2)
chart.Correlation(data2,histogram=TRUE,pch=19) #study the correlation between column variables

#vwap and zscore show significant correlation with a few other variables; they are removed from the model
data2<-data2[,-c(5,7)]#remove columns with high correlation
data4 <- data2

Generalised Linear Regression (Model 1)

A simple linear model is first developed. 80% of the data is used for the training set while the remaining 20% is classified as the test set.

train <- floor(0.8*nrow(data2))

data2.x <- model.matrix(spx~., data = data2)
data2.x <- data2.x[1:train,-1]
data2.y <- data2$spx[1:train]

The LOOCV (leave one out cross validation) method is applied to the dataset.

library(broom)
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ tibble  2.0.1     ✔ purrr   0.3.0
## ✔ tidyr   0.8.2     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::annotate()      masks NLP::annotate()
## ✖ lubridate::as.difftime() masks base::as.difftime()
## ✖ lubridate::date()        masks base::date()
## ✖ dplyr::filter()          masks stats::filter()
## ✖ xts::first()             masks dplyr::first()
## ✖ lubridate::intersect()   masks base::intersect()
## ✖ dplyr::lag()             masks stats::lag()
## ✖ xts::last()              masks dplyr::last()
## ✖ lubridate::setdiff()     masks base::setdiff()
## ✖ lubridate::union()       masks base::union()
library(HDtweedie)
library(glmnet)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
## 
##     expand
## Loading required package: foreach
## 
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
## 
##     accumulate, when
## Loaded glmnet 2.0-16
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:purrr':
## 
##     some
## The following object is masked from 'package:dplyr':
## 
##     recode
library(surveillance)
## Loading required package: sp
## Loading required package: xtable
## 
## Attaching package: 'xtable'
## The following object is masked from 'package:timeDate':
## 
##     align
## This is surveillance 1.17.0. For overview type 'help(surveillance)'.
## 
## Attaching package: 'surveillance'
## The following object is masked from 'package:lubridate':
## 
##     year
Loocv.lasso.data2 <- cv.glmnet(x = data2.x, y = data2.y, alpha = 1, nfolds = nrow(data2.x))
## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations
## per fold
plot(Loocv.lasso.data2)

This graph indicates the MSE for different lamda values. For parsimony purposes, we will use the lamda value with the CV error within 1 standard deviation from the best model.

glance(Loocv.lasso.data2)
## # A tibble: 1 x 2
##   lambda.min lambda.1se
##        <dbl>      <dbl>
## 1      0.253       1.96

Running the model. Generate projected spx values and generate the confusion matrix to find out the accuracy rate of the linear model.

lambda1se.data2 <- Loocv.lasso.data2$lambda.1se
lasso.lambda1se <- glmnet(x = data2.x, y = data2.y, alpha = 1, lambda = lambda1se.data2)
xset.lambda1se <- tidy(lasso.lambda1se)$term
xset.lambda1se <- xset.lambda1se[-1]

f1 <- paste(xset.lambda1se[1])
for(i in 2:length(xset.lambda1se))
{
  f1 <- paste(f1, " + ", xset.lambda1se[i])
}
f1 <- paste("spx", " ~ ", f1)
f1 <- as.formula(f1)
model1.1se <- glm(f1, data = data2)
pchisq(glance(model1.1se)$deviance, glance(model1.1se)$df.residual,lower.tail = F)
## [1] 0

Running the chi-squared test generates a p-value of approximately equal to 0, rejecting the null hypothesis that the model has no explanatory power. This proves that the model is able to explain the movement in spx.

The model summary for the linear model is given below:

summary(model1.1se)
## 
## Call:
## glm(formula = f1, data = data2)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -140.151   -34.612    -6.706    36.865   130.998  
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1202.599    162.039   7.422 1.90e-12 ***
## `p/e`                52.643      4.728  11.135  < 2e-16 ***
## vix                  -6.666      1.423  -4.684 4.67e-06 ***
## `put/call`           40.087     10.647   3.765 0.000209 ***
## excess.mkt.return     4.338      2.591   1.674 0.095375 .  
## smb                  -3.599      3.165  -1.137 0.256635    
## hml                  -5.543      3.019  -1.836 0.067605 .  
## rf                15539.121    529.724  29.334  < 2e-16 ***
## gdp                5302.923   1062.736   4.990 1.15e-06 ***
## inflation         59867.855  19601.733   3.054 0.002506 ** 
## unemployment      -3520.919   1221.542  -2.882 0.004298 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 3052.544)
## 
##     Null deviance: 24459856  on 255  degrees of freedom
## Residual deviance:   747873  on 245  degrees of freedom
## AIC: 2793.3
## 
## Number of Fisher Scoring iterations: 2

Plotting the predicted spx values with the residual (predicted - actual spx values).

Model1withRes <- augment(model1.1se)

pred.data2 <- predict(model1.1se, data2[(train+1):nrow(data2),])
res.data2 <- data2$spx[(train+1):nrow(data2)]-pred.data2
plot(pred.data2, res.data2, type="p")
abline(h=0, col="red")

The plot shows that the residual appears randomly scattered and uncorrelated with the predicted values.

Plotting the standardized residual

plot(x = 1:nrow(data2), y = Model1withRes$.std.resid,
     xlab = "observation ID", ylab = "std residuals",
     ylim = c(-5,5),
     main = "Standardized Residuals")
abline(h = c(-2,2), lty = 3, col = "blue")

The plot shows that the vast majority of the points are located within 2 standard deviations from 0, validating that the model does not show significant abnormality.

Plotting the anscombe residuals

#Anscombe residuals
qqnorm(surveillance::anscombe.residuals(model1.1se, 1), 
       xlim = c(-3,3), ylim = c(-3,3), main = "Anscombe Residuals")
abline(a = 0, b = 1, lty = 2, col = "red")

The quantile-quantile plot shows that the points largely fall very close to the 45-degree line. This shows that the residuals follow a normal distribution, supporting that the model is a good fit.

Examination of Model Accuracy

Generating the confusion matrix and calculate the accuracy rate of prediction. The predicted spx values are compared against the previous week: - if there is an increase of more than 100 basis points (1%), this is defined as a bull week denoted as 1 - a decrease of more than 100 basis points (1%) is defined as a bear week, denoted as -1 - otherwise it’s a base week with value 0

pred.model1<- predict(model1.1se, data2)
classification <- data.frame(change = pred.model1[2:length(pred.model1)]/pred.model1[1:length(pred.model1)-1]-1)
classification$actualchange <- data2$spx[2:length(pred.model1)]/data2$spx[1:length(pred.model1)-1]-1

for (i in 1 : nrow(classification)) {
  ifelse(classification$change[i]>0.01,classification$result[i] <- 1,ifelse(classification$change[i]< -0.01,classification$result[i] <- (-1),classification$result[i] <- 0))
  ifelse(classification$actualchange[i]>0.01,classification$actual[i] <- 1,ifelse(classification$actualchange[i]< -0.01,classification$actual[i] <- (-1),classification$actual[i] <- 0))
  
}

confusion1<-table(classification$actual,classification$result)#plot confusion matrix
accuracy1<- (confusion1[1,1]+confusion1[2,2]+confusion1[3,3])/sum(confusion1) #calculate accuracy rate for model1
confusion1
##     
##      -1  0  1
##   -1 41 29  3
##   0  27 82 23
##   1   1 14 35
cat("accuracy for model1 is ",accuracy1,".")
## accuracy for model1 is  0.6196078 .

We can see that the model produces a decent accuracy rate of approximately 62%.

Generalized Linear Model with Interation Terms Considered (Model 2)

To investigate whether this linear model could be improved further, interaction terms of the explanatory variables are also included in the analysis below.

data4.x <- model.matrix(spx~.^2, data = data4)
data4.x <- data4.x[1:train,-1]
data4.y <- data4$spx[1:train]

Loocv.lasso.data4 <- cv.glmnet(x = data4.x, y = data4.y, alpha = 1, nfolds = nrow(data4.x))
## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations
## per fold
plot(Loocv.lasso.data4)

glance(Loocv.lasso.data4)
## # A tibble: 1 x 2
##   lambda.min lambda.1se
##        <dbl>      <dbl>
## 1      0.137      0.217

To allow a fair comparison with the first linear model, the lamda value with the CV error within 1 standard deviation from the best model is also selected, similar to model 1.

lambda1se.data4 <- Loocv.lasso.data4$lambda.1se
lasso.lambda1se4 <- glmnet(x = data4.x, y = data4.y, alpha = 1, lambda = lambda1se.data4)

xset.lambda1se4 <- tidy(lasso.lambda1se4)$term
xset.lambda1se4 <- xset.lambda1se[-1]

f3 <- paste(xset.lambda1se4[1])
for(i in 2:length(xset.lambda1se4))
{
  f3 <- paste(f3, " + ", xset.lambda1se4[i])
}
f3 <- paste("spx", " ~ ", f3)
f3 <- as.formula(f3)
model4.1se <- glm(f3, data = data2)
pchisq(glance(model4.1se)$deviance, glance(model4.1se)$df.residual,lower.tail = F)
## [1] 0
summary(model4.1se)
## 
## Call:
## glm(formula = f3, data = data2)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -203.367   -37.614    -6.614    33.752   250.385  
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.924e+03  5.956e+01  49.090  < 2e-16 ***
## vix               -1.845e+01  1.165e+00 -15.844  < 2e-16 ***
## `put/call`         4.468e+01  1.303e+01   3.429 0.000709 ***
## excess.mkt.return -1.871e-01  3.134e+00  -0.060 0.952437    
## smb               -1.291e+00  3.868e+00  -0.334 0.738902    
## hml               -5.530e+00  3.698e+00  -1.496 0.136035    
## rf                 1.243e+04  5.516e+02  22.542  < 2e-16 ***
## gdp                7.684e+03  1.275e+03   6.027 6.06e-09 ***
## inflation          1.032e+05  2.353e+04   4.385 1.72e-05 ***
## unemployment      -1.409e+04  9.424e+02 -14.947  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 4578.749)
## 
##     Null deviance: 24459856  on 255  degrees of freedom
## Residual deviance:  1126372  on 246  degrees of freedom
## AIC: 2896.2
## 
## Number of Fisher Scoring iterations: 2

The p-value for the chi-squared test is also approximately 0, rejecting the null hypothesis that this model has no explanatory power.

The various plots below also show that the residual do not exhibit any significant irregularity. However, it seems that the graphs are less ideal compared to those from model 1. For instance, the Anscombe Residual plot shows some deviation from the 45-degree line at low and high quantiles, indicating non-normal behaviour of the residual.

Model1withRes4 <- augment(model4.1se)

pred.data4 <- predict(model4.1se, data2[(train+1):nrow(data2),])
res.data4 <- data2$spx[(train+1):nrow(data2)]-pred.data4
plot(pred.data4, res.data4, type="p")
abline(h=0, col="red")

#Standardized residuals
plot(x = 1:nrow(data2), y = Model1withRes4$.std.resid,
     xlab = "observation ID", ylab = "std residuals",
     ylim = c(-5,5),
     main = "Standardized Residuals")
abline(h = c(-2,2), lty = 3, col = "blue")

#Anscombe residuals
qqnorm(surveillance::anscombe.residuals(model4.1se, 1), 
       xlim = c(-3,3), ylim = c(-3,3), main = "Anscombe Residuals")
abline(a = 0, b = 1, lty = 2, col = "red")

Examination of Model Accuracy

pred.model4<- predict(model4.1se, data2)
classification$model4 <- pred.model4[2:length(pred.model4)]/pred.model4[1:length(pred.model4)-1]-1

for (i in 1 : nrow(classification)) {
  ifelse(classification$model4[i]>0.01,classification$result4[i] <- 1,ifelse(classification$model4[i]< -0.01,classification$result4[i] <- (-1),classification$result4[i] <- 0))
}

confusion4<-table(classification$actual,classification$result4)#plot confusion matrix
accuracy4<- (confusion4[1,1]+confusion4[2,2]+confusion4[3,3])/sum(confusion4) #calculate accuracy rate for model1
confusion4
##     
##      -1  0  1
##   -1 41 27  5
##   0  36 71 25
##   1   2 15 33
cat("accuracy for model4 is ",accuracy4,".")
## accuracy for model4 is  0.5686275 .

The accuracy, despite having included more variables as interaction terms of the existing explanatory variables, is slightly at 57%, which is lower than that of model 1.

Generalised Logistic Model Using LOOCV (Model 3)

We proceed to investigate whether a logistic model could have better predictive power compared to the linear models. Preparation for the logistic model includes defining the dependent variable spx as 3 distinct level: bull, base and bear simular to above.

data3<-data2[-1,]
data3$movement<- classification$actual

#loocv variable selection
data3.x <- model.matrix(movement~., data = data3)
data3.x <- data3.x[1:train,-c(1,13)]
data3.y <- data3$movement[1:train]


Loocv.lasso.data3 <- cv.glmnet(x = data3.x, y = data3.y, alpha = 1, nfolds = nrow(data3.x), family="multinomial")
## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations
## per fold
plot(Loocv.lasso.data3)

The lambda value that gives the lowest MSE is selected.

glance(Loocv.lasso.data3)
## # A tibble: 1 x 2
##   lambda.min lambda.1se
##        <dbl>      <dbl>
## 1     0.0375      0.138
lambdamin.data3<-Loocv.lasso.data3$lambda.min


model2<-glmnet(data3.x,data3.y,alpha=1, family="multinomial",lambda=lambdamin.data3)
lasso.lambdamin2 <- glmnet(x = data3.x, y = data3.y, alpha = 1, lambda = lambdamin.data3)
xset.lambdamin2 <- tidy(lasso.lambdamin2)$term
xset.lambdamin2 <- xset.lambdamin2[-1]

f2 <- paste(xset.lambdamin2[1])
for(i in 2:length(xset.lambdamin2))
{
  f2 <- paste(f2, " + ", xset.lambdamin2[i])
}
f2 <- paste("movement", " ~ ", f2)
f2 <- as.formula(f2)

model2.min <- glm(f2, data = as.data.frame(data3[1:train,]))
summary(model2.min)
## 
## Call:
## glm(formula = f2, data = as.data.frame(data3[1:train, ]))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.21025  -0.61043   0.03294   0.19798   1.39619  
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)  
## (Intercept)       -0.3348761  0.4850832  -0.690   0.4908  
## spx                0.0001696  0.0001659   1.023   0.3077  
## vix               -0.0103469  0.0128159  -0.807   0.4204  
## excess.mkt.return  0.0446492  0.0298041   1.498   0.1357  
## smb               -0.0964857  0.0418738  -2.304   0.0222 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.444156)
## 
##     Null deviance: 93.583  on 203  degrees of freedom
## Residual deviance: 88.387  on 199  degrees of freedom
## AIC: 420.3
## 
## Number of Fisher Scoring iterations: 2

It can be observed that most of the dependent variables become insignificant in this model. This shows that possibly, logistic regression is not the best approach to model the relationship between different levels of spx and the dependent variables. However, for comparison and illustration purposes, we will still carry on with this model and investigate the accuracy of prediction.

Examination of Model Accuracy

pred.model2<- predict(model2.min, data3)

classification2 <- data.frame(actual = data3$movement)

for (i in 1 : nrow(classification2)) {
  ifelse(pred.model2[i]>0.01,classification2$result[i] <- 1,ifelse(pred.model2[i]< -0.01,classification2$result[i] <- (-1),classification2$result[i] <- 0))
}

confusion2<-table(classification2$actual,classification2$result)#plot confusion matrix
accuracy2<- (confusion2[1,1]+confusion2[2,2]+confusion2[3,3])/sum(confusion2) #calculate accuracy rate for model1
confusion2
##     
##      -1  0  1
##   -1 61  3  9
##   0  84  8 40
##   1  34  2 14
cat("accuracy for model2 is ",accuracy2,".")
## accuracy for model2 is  0.3254902 .

The results show that the accuracy rate is significant lower than the linear model, at less than 50%. This echos the observation that the logistic model has relatively weak explanatory power in projecting spx prices.

Overall, it appears that model 1 produces better graphs as well as higher accuracy rate, so it is by far the best model for projecting future spx prices.

Simulation

A simulation is carried out using Model 1, which has the best model accuracy among the 3 regression models. To model all the input variables, distribution fitting is carried out to find out the most suitable distributions.

A simple illustration is included below using the variable vix.

library(fitdistrplus)
## Loading required package: survival
## Loading required package: npsurv
## Loading required package: lsei
#vix appears right-skewed from correlation matrix plot at the start
#finding the most suitable distribution for modelling vix
norm <- fitdist(data2$vix, "norm")
gamma <- fitdist(data2$vix, "gamma")
lnorm <- fitdist(data2$vix,"lnorm")
plot.legend <- c("Normal", "Gamma", "Lognormal")
denscomp(list(norm, gamma, lnorm), legendtext = plot.legend)

The density plot seems to suggest that lognormal is the most suitable distribution to model vix. However, a more scientific approach is employed by selecting the distribution that gives the lowest AIC value.

which.min(c(norm$aic, gamma$aic, lnorm$aic))
## [1] 3

The lognormal distribution is finally selected as it yields the minimum AIC value among the 3 candidates.

The simulation model is built to be project the spx price in the first 12 weeks of 2019 (as the input data used for modelling terminates at the last week of 2018). It is an iterative simulation in which input variables are generated for 1 week through samping using the various distributions chosen. These sampled data is added to the input data file to generate the input values for the following week, until the input variable values for the 12th week are finally generated. An estimation of the spx price is computed for each week and the week-on-week price change is calculated.

The user is expected to provide the simulation inputs including the number of iterations to be run, the simulation period (in terms of weeks) as well as the cutoff of spx price change the user wishes to track. For instance, should the user wish to know the probability that the spx price will fluctuate more than 2%, 0.02 is included as the last simulation input.

library(fitdistrplus)
simulate.input<-data2

#default simulation period as 12 weeks (1 quarter)
iteration<-function(iter,periods = 12,cutoff = 0.02){
  predicted <- c()
  
  for(i in 1:periods){
    fit.vix <- fitdist(simulate.input$vix,"lnorm")
    vix.sample <- rlnorm(1,meanlog = fit.vix$estimate,sdlog = fit.vix$sd)
    
    fit.pe <- fitdist(simulate.input$`p/e`,"norm")
    pe.sample <- rnorm(1,fit.pe$estimate,fit.pe$sd)
    
    fit.put.call <- fitdist(simulate.input$`put/call`,"norm")
    put.call.sample <- rnorm(1,fit.put.call$estimate,fit.put.call$sd)
    
    fit.excess.mkt <- fitdist(simulate.input$excess.mkt.return,"norm")
    excess.mkt.sample <- rnorm(1,fit.excess.mkt$estimate,fit.excess.mkt$sd)
    
    fit.smb <- fitdist(simulate.input$smb,"norm")
    smb.sample <- rnorm(1,fit.smb$estimate,fit.smb$sd)
    
    hml.norm<-fitdist(simulate.input$hml,"norm")
    hml.sample<-rnorm(1,hml.norm$estimate[1],hml.norm$estimate[2])
    
    gdp.norm<-fitdist(simulate.input$gdp,"norm")
    gdp.sample<-rnorm(1,gdp.norm$estimate[1],gdp.norm$estimate[2])
    
    #unemployment, risk-free rate and inflation are assumed to stay constant over a quarter
    unemployment.sample<-0.037
    rf.sample<-0.0048
    inflation.sample<-0.00025
    
    simulate.input<-rbind(simulate.input,c(1000,pe.sample,vix.sample,put.call.sample,52.536,excess.mkt.sample,smb.sample,hml.sample,rf.sample,gdp.sample,inflation.sample,unemployment.sample))
    predicted[i] <- predict(model1.1se,simulate.input[nrow(simulate.input),])
  }
  change <- predicted[2:periods]/predicted[1:(periods-1)] -1
  boom <- ifelse(sum(change > cutoff)>0,1,0)
  return(boom)
}

simulate<-function(iter,periods = 12,cutoff = 0.05){
  result<-sapply(1:iter,iteration,periods,cutoff)
  prob <- sum(result)/iter
  cat("The probability of getting a market increase of ",cutoff," is ",prob,".\n")
}

simulate(10,12,0.025) #iterations reduced to speed up knitting
## The probability of getting a market increase of  0.025  is  0.3 .

The results show that the probability that the spx price will have a week-on-week increase of more than 2.5% (250 basis points) is approximately 20% to 30%, which is a bullish signal for the investor to buy spx now (which is at the start of 2019). This is consistent with the actual spx price trend in Quarter 1 2019, in which the week-on-week return figures indeed showed 3 out 12 that showed an increase of more than 2.5%.

Conclusion

In conclusion, we have attempted to apply PCA, Text Analytics, Regression and Simulation to forecast the US stock market performance in 2019. The stock market performance is tracked using a proxy, the S&P 500 index. Nowcasting GDP using the PCA approach did not generate satisfactory accuracy, but it remains an interesting experiment to project numerical values using principal component analysis.

Modelling of the stock market performance using text analytics and regression generates similar accuracy rates of approximately 60%. Text Analytics helps investors to gauge the general sentiment among participants of the financial market, which is a qualitative measure that is hard to be captured by numerical indicators. In comparison, the regression analysis utilises quantitative measures to project the actual level the spx price could reach and it can be observed that Q1 2019 is predicted to be bullish, which tallies with the actual movement of spx.

We are also aware that both approaches face limitations that adversely affect the applicability and accuracy of the results. Text analytics is constrained by the time horizon of data collection and the dictionary available for the financial jargons. The regression model, on the other hand, fails to adequately capture the clustering nature of financial indices, as market indicators and prices tend to be closer to each other in neighbouring periods rather than periods that are further apart. Also, the regression model is unable to predict the possibility of “black swan events”, which are sudden market crashes fuelled by extreme sentiments such as panic. Text Analytics might be more helpful in capturing these events.

Overall, the results generated are definitely far from ideal but employing these methods could help investors to gauge the likely direction that the stock market might move towards, so that they could frame their investment strategy accordingly.